Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technology, we present BLOOM, a 176B-parameter open-access language model designed and built thanks to a collaboration of hundreds of researchers. BLOOM is a decoder-only Transformer language model that was trained on the ROOTS corpus, a dataset comprising hundreds of sources in 46 natural and 13 programming languages (59 in total). We find that BLOOM achieves competitive performance on a wide variety of benchmarks, with stronger results after undergoing multitask prompted finetuning. To facilitate future research and applications using LLMs, we publicly release our models and code under the Responsible AI License.
translated by 谷歌翻译
Named entity recognition models (NER), are widely used for identifying named entities (e.g., individuals, locations, and other information) in text documents. Machine learning based NER models are increasingly being applied in privacy-sensitive applications that need automatic and scalable identification of sensitive information to redact text for data sharing. In this paper, we study the setting when NER models are available as a black-box service for identifying sensitive information in user documents and show that these models are vulnerable to membership inference on their training datasets. With updated pre-trained NER models from spaCy, we demonstrate two distinct membership attacks on these models. Our first attack capitalizes on unintended memorization in the NER's underlying neural network, a phenomenon NNs are known to be vulnerable to. Our second attack leverages a timing side-channel to target NER models that maintain vocabularies constructed from the training data. We show that different functional paths of words within the training dataset in contrast to words not previously seen have measurable differences in execution time. Revealing membership status of training samples has clear privacy implications, e.g., in text redaction, sensitive words or phrases to be found and removed, are at risk of being detected in the training dataset. Our experimental evaluation includes the redaction of both password and health data, presenting both security risks and privacy/regulatory issues. This is exacerbated by results that show memorization with only a single phrase. We achieved 70% AUC in our first attack on a text redaction use-case. We also show overwhelming success in the timing attack with 99.23% AUC. Finally we discuss potential mitigation approaches to realize the safe use of NER models in light of the privacy and security implications of membership inference attacks.
translated by 谷歌翻译
3D对象检测是自动驾驶的重要组成部分,深层神经网络(DNNS)已达到此任务的最新性能。但是,深层模型臭名昭著,因为将高置信度得分分配给分布(OOD)输入,即未从训练分布中得出的输入。检测OOD输入是具有挑战性的,对于模型的安全部署至关重要。已经针对分类任务进行了广泛研究OOD检测,但是它尚未对对象检测任务,特别是基于激光雷达的3D对象检测的注意力。在本文中,我们关注基于激光雷达的3D对象检测的OOD输入的检测。我们制定了OOD输入对于对象检测的含义,并提议适应几种OOD检测方法进行对象检测。我们通过提出的特征提取方法来实现这一目标。为了评估OOD检测方法,我们开发了一种简单但有效的技术,用于为给定的对象检测模型生成OOD对象​​。我们基于KITTI数据集的评估表明,不同的OOD检测方法具有检测特定OOD对象​​的偏差。它强调了联合OOD检测方法的重要性以及在这个方向上进行更多研究。
translated by 谷歌翻译
数值验证是机器学习研究的核心,因为它允许评估新方法的实际影响,并确认理论和实践之间的一致性。然而,该领域的快速发展构成了一些挑战:研究人员面临着大量的方法来比较,有限的透明度和最佳实践的共识以及乏味的重新实施工作。结果,验证通常是非常部分的,这可能会导致错误的结论,从而减慢研究的进展。我们提出了Benchopt,这是一个协作框架,旨在在跨编程语言和硬件体系结构的机器学习中自动化,复制和发布优化基准。 Benchopt通过提供用于运行,共享和扩展实验的现成工具来简化社区的基准测试。为了展示其广泛的可用性,我们在三个标准学习任务上展示基准:$ \ ell_2 $ regulaine的逻辑回归,套索和RESNET18用于图像分类的培训。这些基准强调了关键的实际发现,这些发现对这些问题的最新问题更加细微,这表明在实际评估中,魔鬼在细节上。我们希望Benchopt能在社区中促进合作工作,从而改善研究结果的可重复性。
translated by 谷歌翻译
我们提出了一个逻辑框架,该框架正式建模给定数据库D上的给定私有信息P如何通过代理/对手反复查询数据库逐渐捕获。命名为DLTTS(分布式标记为标记的过渡系统),框架借用了几个领域的想法:Segala的概率自动机,概率并发系统和概率标记的过渡系统。 DLTTS上的每个节点都附加了一个标签,该标签代表了对手的“当前”知识,该标签是从DBMS对其查询的答案机制的回答中获得的,在任何给定的运行中,都在前面遍历的节点;这些知识以相同的节点完成,并进行进一步的关系扣除,可能与事先给出的其他数据库的“公共”信息结合使用。 “黑框”机制也是DLTTS的一部分,它是甲骨文的。它的作用是确定私人信息是否是由对手在当前节点上推导的,如果这样终止了运行。另一个特殊功能是,黑框还提供了有关“接近”或“远”的信息,从私人信息p,在当前节点上对对手的知识是如何的。为此目的定义了一个度量标准,从给定数据库的所有“类型兼容”元组的集合,数据本身与基数标题键入。尽管我们的框架具有过渡系统的风味,但在其他作品中提出的意义上,该指标并不是“行为”。它仅以数据库为导向,并允许在数据库之间定义新的邻接和indingishabilty的概念,比通常基于Hamming Metric(和邻接的受限概念)的数据库之间的不一致。一直提供示例以说明我们的框架的工作原理。关键字:数据库,隐私,过渡系统,概率,发行。
translated by 谷歌翻译
Here, we demonstrate how machine learning enables the prediction of comonomers reactivity ratios based on the molecular structure of monomers. We combined multi-task learning, multi-inputs, and Graph Attention Network to build a model capable of predicting reactivity ratios based on the monomers chemical structures.
translated by 谷歌翻译
The recent increase in public and academic interest in preserving biodiversity has led to the growth of the field of conservation technology. This field involves designing and constructing tools that utilize technology to aid in the conservation of wildlife. In this article, we will use case studies to demonstrate the importance of designing conservation tools with human-wildlife interaction in mind and provide a framework for creating successful tools. These case studies include a range of complexities, from simple cat collars to machine learning and game theory methodologies. Our goal is to introduce and inform current and future researchers in the field of conservation technology and provide references for educating the next generation of conservation technologists. Conservation technology not only has the potential to benefit biodiversity but also has broader impacts on fields such as sustainability and environmental protection. By using innovative technologies to address conservation challenges, we can find more effective and efficient solutions to protect and preserve our planet's resources.
translated by 谷歌翻译
We address the problem of extracting key steps from unlabeled procedural videos, motivated by the potential of Augmented Reality (AR) headsets to revolutionize job training and performance. We decompose the problem into two steps: representation learning and key steps extraction. We employ self-supervised representation learning via a training strategy that adapts off-the-shelf video features using a temporal module. Training implements self-supervised learning losses involving multiple cues such as appearance, motion and pose trajectories extracted from videos to learn generalizable representations. Our method extracts key steps via a tunable algorithm that clusters the representations extracted from procedural videos. We quantitatively evaluate our approach with key step localization and also demonstrate the effectiveness of the extracted representations on related downstream tasks like phase classification. Qualitative results demonstrate that the extracted key steps are meaningful to succinctly represent the procedural tasks.
translated by 谷歌翻译
We introduce Argoverse 2 (AV2) - a collection of three datasets for perception and forecasting research in the self-driving domain. The annotated Sensor Dataset contains 1,000 sequences of multimodal data, encompassing high-resolution imagery from seven ring cameras, and two stereo cameras in addition to lidar point clouds, and 6-DOF map-aligned pose. Sequences contain 3D cuboid annotations for 26 object categories, all of which are sufficiently-sampled to support training and evaluation of 3D perception models. The Lidar Dataset contains 20,000 sequences of unlabeled lidar point clouds and map-aligned pose. This dataset is the largest ever collection of lidar sensor data and supports self-supervised learning and the emerging task of point cloud forecasting. Finally, the Motion Forecasting Dataset contains 250,000 scenarios mined for interesting and challenging interactions between the autonomous vehicle and other actors in each local scene. Models are tasked with the prediction of future motion for "scored actors" in each scenario and are provided with track histories that capture object location, heading, velocity, and category. In all three datasets, each scenario contains its own HD Map with 3D lane and crosswalk geometry - sourced from data captured in six distinct cities. We believe these datasets will support new and existing machine learning research problems in ways that existing datasets do not. All datasets are released under the CC BY-NC-SA 4.0 license.
translated by 谷歌翻译
Modern deep neural networks have achieved superhuman performance in tasks from image classification to game play. Surprisingly, these various complex systems with massive amounts of parameters exhibit the same remarkable structural properties in their last-layer features and classifiers across canonical datasets. This phenomenon is known as "Neural Collapse," and it was discovered empirically by Papyan et al. \cite{Papyan20}. Recent papers have theoretically shown the global solutions to the training network problem under a simplified "unconstrained feature model" exhibiting this phenomenon. We take a step further and prove the Neural Collapse occurrence for deep linear network for the popular mean squared error (MSE) and cross entropy (CE) loss. Furthermore, we extend our research to imbalanced data for MSE loss and present the first geometric analysis for Neural Collapse under this setting.
translated by 谷歌翻译